{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Evaluating Robust Models\n",
"\n",
"The goal of this notebook is to show how to compare several methods across several datasets.This will also serve as inroduction to two important `scikit-clean` functions: `load_data` and `compare`. \n",
"\n",
"We'll (roughly) implement the core idea of 3 existing papers on robust classification in the presence of label noise, and see how they compare on our 4 datasets readily available in `scikit-clean`. Those papers are:\n",
"\n",
"1. Forest-type Regression with General Losses and Robust Forest - ICML'17 (`RobustForest` below in `MODELS` dictionary)\n",
"2. An Ensemble Generation Method Based on Instance Hardness- IJCNN'18 (`EGIH`)\n",
"3. Classification with label noise- a Markov chain sampling framework - ECML-PKDD'18 (`MCS`)"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import os\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"from sklearn.tree import DecisionTreeClassifier\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"from sklearn.linear_model import LogisticRegression\n",
"from sklearn.svm import SVC\n",
"from sklearn.neural_network import MLPClassifier\n",
"from sklearn.model_selection import StratifiedKFold\n",
"from sklearn.metrics import accuracy_score, make_scorer\n",
"\n",
"from skclean.detectors import KDN, InstanceHardness, MCS\n",
"from skclean.handlers import WeightedBagging, SampleWeight, Filter\n",
"from skclean.models import RobustForest\n",
"from skclean.pipeline import Pipeline, make_pipeline\n",
"from skclean.utils import load_data, compare\n",
"\n",
"import seaborn as sns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll use 4 datasets here, all come preloaded with `scikit-clean`. If you want to load new datasets through this function, put the csv-formatted dataset file in `datasets` folder (use `os.path.dirname(skclean.datasets.__file__)` to get it's location). Make sure labels are at the last column, and features are all real numbers. Check source code of `load_data` for more details."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"DATASETS = ['iris', 'breast_cancer', 'optdigits', 'spambase']\n",
"SEED = 42 # For reproducibility\n",
"N_JOBS = 8 # No of cpu cores to use in parallel\n",
"CV = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED+1) \n",
"SCORING = 'accuracy'"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"MODELS = {\n",
" 'RobustForest': RobustForest(n_estimators=100),\n",
" 'EGIH':make_pipeline(KDN(), WeightedBagging()),\n",
" 'MCS': make_pipeline(MCS(), SampleWeight(LogisticRegression()))\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We'll create 30% uniform label noise for all our datasets using `UniformNoise`. Note that we're treating noise simulation as data transformation step and attaching it before our models in a pipeline. In this way, noise will only impact training set, and testing will be performed on clean labels."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"from skclean.simulate_noise import UniformNoise\n",
"\n",
"N_MODELS = {}\n",
"for name, clf in MODELS.items():\n",
" N_MODELS[name] = make_pipeline(UniformNoise(.3), clf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`scikit-clean` models are compatible with `scikit-learn` API. So for evaluation, we'll use `cross_val_score` function of scikit-learn- this will create multiple train/test according to the `CV` variable we defined at the beginning, and compute performance. It also allows easily parallelizing the code using `n_jobs`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"iris, (150, 4), 3, 1.000\n",
"\n",
"iris, RobustForest: 0.8067 in 0.94 secs\n",
"iris, EGIH: 0.9133 in 0.73 secs\n",
"iris, MCS: 0.7067 in 0.11 secs\n",
"\n",
"breast_cancer, (569, 30), 2, 0.594\n",
"\n",
"breast_cancer, RobustForest: 0.8664 in 0.22 secs\n",
"breast_cancer, EGIH: 0.8981 in 0.82 secs\n",
"breast_cancer, MCS: 0.9367 in 0.11 secs\n",
"\n",
"optdigits, (5620, 64), 10, 0.969\n",
"\n",
"optdigits, RobustForest: 0.9402 in 1.48 secs\n",
"optdigits, EGIH: 0.9649 in 8.03 secs\n",
"optdigits, MCS: 0.9584 in 6.36 secs\n",
"\n",
"spambase, (4601, 57), 2, 0.650\n",
"\n",
"spambase, RobustForest: 0.7857 in 1.12 secs\n",
"spambase, EGIH: 0.8581 in 7.18 secs\n",
"spambase, MCS: 0.8303 in 0.45 secs\n",
"\n"
]
}
],
"source": [
"from time import perf_counter # Wall time\n",
"from sklearn.model_selection import cross_val_score\n",
"\n",
"for data_name in DATASETS:\n",
" X,y = load_data(data_name, stats=True) \n",
" \n",
" for clf_name, clf in N_MODELS.items():\n",
" start_at = perf_counter()\n",
" r = cross_val_score(clf, X, y, cv=CV, n_jobs=N_JOBS, scoring=SCORING).mean()\n",
" print(f\"{data_name}, {clf_name}: {r:.4f} in {perf_counter()-start_at:.2f} secs\")\n",
" print()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `compare` function does basically the same thing the above cell does. Plus, it stores the results in a CSV format, with datasets in rows and algorithms in columns. And it can also automatically resume after interruption."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wall time: 24.5 s\n"
]
},
{
"data": {
"text/html": [
"